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O f Abstract. In the recent years, we have linked the largest corpus of 

D ■ formal mathematics with ATP tools, and started to develop combined 

| AI/ATP systems working in this setting. In this paper we first relate 

this project to the earlier large-scale automated developments done by 
Quaife with McCune's Otter system, and to the discussions about the 
QED project about formalizing a significant part of mathematics. Then 
we summarize our adventure so far, argue that the QED dreams were 
right in anticipating the creation of a very interesting semantic AI field, 
and discuss its further research directions. 



1 OTTER and QED 

Twenty years ago, in 1992, Art Quaife's book Automated Development of Funda- 
mental Mathematical Theories [Qua92b] was published. In the conclusion to his 
JAR paper [Qua92a] about the development of set theory Quaife cites Hilbert's 
"No one shall be able to drive us from the paradise that Cantor created for us ", 
and says that: 

The time will come when such crushers as Riemann's hypothesis and 
Goldbach's conjecture will be fair game for automated reasoning pro- 
grams. For those of us who arrange to stick around, endless fun awaits 
us in the automated development and eventual enrichment of the corpus 
of mathematics. 

Quaife's experiments were done using an ATP system that has left so far 
perhaps the greatest influence on the field of Automated Theorem Proving: Bill 
McCune's Otter. Bill McCune's opinion on using Otter and similar Automated 
Reasoning methods for general mathematics was probably more reserved than 
Quaife's. The Otter manual [McC03] (right before acknowledging Quaife's work) 
states: 
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Some of the first applications that come to mind when one hears "au- 
tomated theorem proving" are number theory, calculus, and plane ge- 
ometry, because these are some of the first areas in which math stu- 
dents try to prove theorems. Unfortunately, OTTER cannot do much in 
these areas: interesting number theory problems usually require induc- 
tion, interesting calculus and analysis problems usually require higher- 
order functions, and the first-order axiomatizations of geometry are not 
practical. 

Yet, Bill McCune was also a part of the QED 3 discussions and workshops 
about making a significant part of mathematics computer understandable, ver- 
ified, and available for a number of applications. And ATP systems based on 
the ideas that were first developed in Otter have been now for several years 
really used to prove lemmas in general mathematical developments in the large 
ATP-translated libraries of Mizar and Isabelle. 

This paper summarizes our experience so far with the QED-inspircd project 
of developing automated reasoning methods for large general computer- understandable 
mathematics, particularly in the large Mizar Mathematical Library. A bit as Art 
Quaife did, we believe (and try to argue below) that automated reasoning in gen- 
eral mathematics is one of the most exciting research fields, where a number of 
new and interesting topics for general AI research emerge today We hope that 
the paper might be of some interest to those QED-dreamers who remember the 
great minds of the recently deceased Bill McCune, John McCarthy, and N.G. de 
Bruijn. 

2 Why Link Large Formal Mathematics with AI/ATP 
Methods? 

The QED Manifesto has the following conservative opinion about the usefulness 
of automated methods to a QED-like project: 

It is the view of some of us that many people who could have easily 
contributed to project QED have been distracted away by the enticing 
lure of AI or AR. It can be agreed that the grand visions of AI or AR are 
much more interesting than a completed QED system while still believing 
that there is great aesthetic, philosophical, scientific, educational, and 
technological value in the construction of the QED system, regardless 
of whether its construction is or is not largely done 'by hand' or largely 
automatically. 

Our opinion is that formalization and automation are two sides of the same 
coin. There are three kinds of benefits in linking formal proof assistants like Mizar 
and their libraries with the Automated Reasoning technology and particularly 
ATPs: 
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1. The obvious benefits for the proof assistants and their libraries. Automated 
Reasoning and AI methods can provide a number of tools and strong methods 
that can assist the formalization, provide advanced search and hint functions, 
and prove lemmas and theorems (semi-)automatically. The QED Manifesto says: 

The QED system we imagine will provide a means by which mathemati- 
cians and scientists can scan the entirety of mathematical knowledge for 
relevant results and, using tools of the QED system, build upon such 
results with reliability and confidence but without the need for minute 
comprehension of the details or even the ultimate foundations of the 
parts of the system upon which they build 

2. The (a bit less obvious) benefits for the field of Automated Reasoning. For 
example, research in automated reasoning over very large libraries is painfully 
theoretical (and practically useless) until such libraries are really available for 
experiments. Mathematicians (and scientists, and other human "reasoners") typ- 
ically know a lot of things about the domains of discourse, and use the knowledge 
in many ways that include many heuristic methods. It thus seems unrealistic 
(and limiting) to develop the automated reasoning tools solely for problems that 
contain only a few axioms, make little use of previously accumulated knowledge, 
and do not attempt to further accumulate and organize the body of knowledge. 
In his 1996 review of Quaife's book, Desmond Fearnley-Sander says: 

The real work in proving a deep theorem lies in the development of 
the theory that it belongs to and its relationships to other theories, the 
design of definitions and axioms, the selection of good inference rules, 
and the recognition and proof of more basic theorems. Currently, no 
resolution-based program, when faced with the stark problem of proving 
a hard theorem, can do all this. That is not surprising. No person can 
either. Remarks about standing on the shoulders of giants are not just 
false modesty... 

3. The benefits for the field of general Artificial Intelligence. These benefits are 
perhaps the least mentioned ones 4 , however to the authors they appear to be the 
strongest long-term motivation for this kind of work. In short, the AI fields of 
deductive reasoning and the inductive reasoning (represented by machine learn- 
ing, data mining, knowledge discovery in databases, etc.) have so far benefited 
relatively little from each other's progress. This is an obvious deficiency in com- 
parison with the human mind, which can both inductively suggest new ideas 
and problem solutions based on analogy, memory, statistical evidence, etc., and 
also confirm, adjust, and even significantly modify these ideas and problem so- 
lutions by deductive reasoning and explanation, based on the understanding of 
the world. Repositories of "human thought" that are both large (and thus allow 



4 This may also be due to the frequent feeling of too many unfulfilled promises and too 
high expectations from the general AI, that also led to the current lack of funding 
for projects mentioning Artificial Intelligence. 



the inductive methods), and have precise and deep semantics (and thus allow 
deduction) should be a very useful component for cross-fertilization of these two 
fields. QED-like large formal mathematical libraries are currently the closest ap- 
proximation to such a computer-understandable repository of "human thought" 
usable for these purposes. To be really usable, the libraries however again have 
to be presented in a form that is easy to understand to existing automated 
reasoning tools. The Fearnley-Sander's quote started above continues as: 

... Great theorems require great theories and theories do not, it seems, 
emerge from thin air. Their creation requires sweat, knowledge, imagi- 
nation, genius, collaboration and time. As yet there is not much serious 
collaboration of machines with one another, and we are only just begin- 
ning to see real symbiosis between people and machines in the exercise 
of rationality. 

3 Why Mizar? 

The Mizar proof assistant was chosen by the first author for experiments with 
automated reasoning tools because of its focus on building the large formal Mizar 
Mathematical Library (MML). This formalization effort has been started in 1989 
by the Mizar team, and its main purpose is to verify a large body of mainstream 
mathematics in a way that is close and easily understandable to mathematicians, 
allowing them to build on this library with proofs from more and more advanced 
mathematical fields. The QED discussions often used Mizar and its library as 
a prototypical example for the project. The particular formalization goals have 
influenced: 

— the choice of a relatively human-oriented formal language in Mizar 

— the choice of the declarative Mizar proof style (Jaskowski-style natural de- 
duction) 

— the choice of first-order logic and set theory as unified common foundations 
for the whole library 

— the focus on developing and using just one human-obvious first-order justi- 
fication rule in Mizar 

— and the focus on making the large library interconnected, usable for more 
advanced formalizations, and using consistent notation. 

There have always been other systems and projects that are similar to Mizar 
in some of the above mentioned aspects. For example, building large and ad- 
vanced formal libraries seems to be more and more common today, probably 
also because of the recent large formalization projects like Flyspeck that require 
a number of previously proved nontrivial mathematical results. In the work that 
is described here, Mizar thus should be considered as a suitable particular choice 
of a system for formalization of mathematics which uses relatively common and 
accessible foundations, and produces a large formal library written in a relatively 
simple and casy-to-undcrstand style. Some of the systems described below actu- 
ally already work also with other than Mizar data: for example, MaLARca has 



already been successfully used for reasoning over problems from the large formal 
SUMO ontology, and for experiments with Isabelle/Sledgehammer problems. 

4 MPTP: Translating Mizar for Automated Reasoning 
tools 

The Mizar's translation (MPTP - Mizar problems for Theorem Proving) to pure 
first-order logic is described in detail in [Urb03,Urb04,Urb07b]. The translation 
has to deal with a number of Mizar extensions and practical issues related to 
the Mizar implementation, implementations of first-order ATP systems, and the 
most frequent uses of the translation system. 

The first version (published in early 2003 5 ) has been used for initial ex- 
ploration of the usability of ATP systems on the Mizar Mathematical Library 
(MML). The first important number obtained was the 41% success rate of ATP- 
reproving of about 30000 MML theorems from selected Mizar theorems and 
definitions 6 taken from corresponding MML proofs. 

No previous evidence about the feasibility and usefulness of ATP methods 
on a very large library like MML was available prior to the experiments done 
with MPTP 0.1 7 , sometimes leading to overly pessimistic views on such a project. 
Therefore the goal of this first version was to relatively quickly achieve a "mostly- 
correctly" translated version of the whole MML that would allow to measure 
and assess the potential of ATP methods for this large library. Many shortcuts 
and simplifications were therefore taken in this first MPTP version, for example 
direct encoding in the DFG [HKW96] syntax used by the SPASS [WBH+02] 
system, no proof export, incomplete export of some relatively rare constructs 
(structure types and abstract terms), etc. 

Many of these simplifications however made further experiments with MPTP 
difficult or impossible, and also made the 41% success rate uncertain. The lack of 
proof structure prevented measurements of ATP success rate on all internal proof 
lemmas, and experiments with unfolding lemmas with their own proofs. Addi- 
tionally, even if only several abstract terms were translated incorrectly, during 
such proof unfoldings they could spread much wider. Experiments like finding 
new proofs, and cross- verification of Mizar proofs (described below) would suffer 
from constant doubt about the possible amount of error caused by the incorrectly 
translated parts of Mizar, and debugging would be very hard. 

Therefore, after the encouraging initial experiments, a new version of MPTP 
started to be developed in 2005, requiring first a substantial rc-implemcntation of 
Mizar interfaces described in [Urb05] . This version consists of two layers (Mizar- 
extended TPTP format processed in Prolog, and a Mizar XML format) that are 

5 http: //mizar .uwb . edu.pl/forum/archive/0303/msg00004 .html 

6 Precisely: other Mizar theorems and definitions mentioned in the Mizar proofs. 

7 A lot of work on MPTP has been inspired by the previous work done in the ILF 
project [DW97] on importing Mizar. However it seemed that the project had stopped 
before it could finish the export of the whole MML to ATP problems and provide 
some initial overall statistics of ATP success rate on MML. 



sufficiently flexible and have allowed a number of gradual additions of various 
functions over the past years. The experiments described below are typically 
done on this version (and its extensions). 

5 Experiments and projects based on the MPTP 

MPTP has so far been used for 

— experiments with re-proving Mizar theorems and simple lemmas by ATPs 
from the theorems and definitions used in the corresponding Mizar proofs 

— experiments with fully automated re-proving of Mizar theorems, i.e. the nec- 
essary axioms being selected fully automatically from the whole available 
MML 

— finding new ATP proofs that are simpler than the original Mizar proofs 

— ATP-bascd cross-verification of the Mizar proofs 

— ATP-based explanation of Mizar atomic inferences 

— inclusion of Mizar problems into the TPTP problem library, and unified web 
presentation of Mizar together with the corresponding TPTP problems 

— creation of the MPTP $100 Challenges for reasoning in large theories in 
2006, creation of the MZR category of the CASC Large Theory Batch (LTB) 
competition in 2008, and creation of the MPTP2078 benchmark in 2011 

— testbed for AI systems like MaLARca and MaLeCoP targeted at reasoning 
in large theories and combining inductive techniques like machine learning 
with deductive reasoning 

5.1 Re-proving experiments 

As mentioned in Section 4, the initial large-scale experiment done with MPTP 
0.1 indicated that 41% of the Mizar proofs can be automatically found by ATPs, 
if the users provide as axioms to the ATPs the same theorems and definitions 
which are used in the Mizar proofs, plus the corresponding background formulas 
(formulas implicitly used by Mizar, for example to implement type hierarchies). 
As already mentioned, this number was far from certain, e.g., out of the 27449 
problems tried, 625 were shown to be CountcrSatisfiable in a relatively low time- 
limit given to SPASS (pointing to various oversimplifications taken in MPTP 
0.1). This experiment was therefore repeated with MPTP 0.2, however only 
with 12529 problems that come from articles that do not use internal arithmeti- 
cal evaluations done by Mizar. These evaluations were not handled by MPTP 
0.2 at the time of conducting these experiments, being the last (known) part 
of Mizar that could be blamed for possible ATP incompleteness. The E prover 
version 0.9 and SPASS version 2.1 were used for this experiment, with 20s time 
limit (due to limited resources). The results (reported in [Urb07b]) are given in 
Table 1. 39% of the 12529 theorems were proved by either SPASS or E, and no 
countersatisfiability was found. 

These results have thus to a large extent confirmed the optimistic outlook of 
the first measurement in MPTP 0.1. In later experiments, this ATP performance 



Table 1. Reproving of the theorems from non-numerical articles by MPTP 0.2 in 2005 



description 


proved 


countersatisfiable 


timeout or memory out 


total 


E 0.9 


4309 





8220 


12529 


SPASS 2.1 


3850 





8679 


12529 


together 


4854 





7675 


12529 



has been steadily going up, sec Table 2 for results from 2007 run with 60s time- 
limit. This is a result of better pruning of redundant axioms in MPTP, and also 
of ATP development, which obviously was influenced by the inclusion of MPTP 
problems into the TPTP library, forming a significant part of the FOF problems 
in the CASC competition since 2006. The newer versions of E and SPASS solved 
in this increased timelimit together 6500 problems, i.e. 52% of them all. With ad- 
dition of Vampire and its customized Fampire version (which alone solves 51% of 
the problems), the combined success rate went up to 7694 of these problems, i.e. 
to 61%. The caveat is that the methods for dealing with arithmetic are becom- 
ing stronger and stronger in Mizar, and it is so far not clear how to efficiently 
handle them in ATPs. The MPTP problem creation for problems containing 
arithmetic's is thus currently quite crude, and the ATP success rate on such 
problems will likely be significantly lower than on the nonarithmctical ones. 

Table 2. Reproving of the theorems from non-numerical articles by MPTP 0.2 in 2007 



description 


proved 


countersatisfiable 


timeout or memory out 


total 


E 0.999 


5661 





6868 


12529 


SPASS 2.2 


5775 





6754 


12529 


E+SPASS together 


6500 






12529 


Vampire 8.1 


5110 





7419 


12529 


Vampire 9 


5330 





7119 


12529 


Fampire 9 


6411 





6118 


12529 


all together 


7694 






12529 



5.2 Finding new proofs and the AI aspects 

MPTP 0.2 was also used to try to prove Mizar theorems fully automatically, i.e., 
the choice of premises for each theorem was done automatically, and all previ- 
ously proved theorems were eligible. Because giving ATPs thousands of axioms 
is usually hopeless 8 , the axiom selection was done by symbol-based machine 
learning from the previously available proofs. The results (reported in [Urb07b]) 
are given in Table 3. 2408 from the 12529 theorems were proved either by E 0.9 

8 This is changing as we go: the CASC-LTB category has already sparked interest in 
ATP systems dealing efficiently with large numbers of axioms, see [Urbll] for a brief 
overview of the large theory methods developed so far. 



or SPASS 2.1 from the axioms selected by the machine learner, the combined 
success rate of this whole system was thus 19%. 

Table 3. Proving new theorems with machine learning support by MPTP 0.2 in 2005 



description 


proved 


countersatisfiability 


timeout or memory out 


total 


E 0.9 


2167 





10362 


12529 


SPASS 2.1 


1543 





10986 


12529 


together 


2408 





10121 


12529 



This experiment demonstrates a very real and quite unique benefit of large 
formal mathematical libraries for conducting novel integration of AI methods. 
As the machine learner is trained on previous proofs, it recommends relevant 
premises from the large library that (according to the past experience) should 
be useful for proving new conjectures. A variety of machine learning methods 
(neural nets, Bayes nets, decision trees, nearest neighbor, etc.) can be used for 
this, and their performance evaluated in the standard machine learning way, 
i.e., by looking at the actual axiom selection done by the human author in 
the Mizar proof, and comparing it with the selection suggested by the trained 
learner. However, what if the machine learner is sometimes more clever than 
the human, and suggests a completely different (and perhaps better) selection 
of premises, leading to a different proof? In such a case, the standard machine 
learning evaluation (i.e. comparison of the two sets of premises) will say that the 
two sets of premises differ too much, and thus the machine learner has failed. This 
is considered acceptable for machine learning, as in general, there is no deeper 
concept of truth available, there are just training and testing data. However in 
our domain we do have a method how to show that the trained learner was right 
(and possibly smarter than the human): we can run an ATP system on its axiom 
selection. If a proof is found, it provides a much stronger measure of correctness. 
Obviously, this is only true if we know that the translation from Mizar to TPTP 
was correct, i.e., conducting such experiments really requires that we take extra 
care to ensure that no oversimplifications were made in this translation. 

In the above mentioned experiment, 329 from the 2408 (i.e. 14%) proofs 
found by ATPs were shorter (used less premises) than the original MML proof. 
An example of such proof shortening is discussed in [UrbOTb] , showing that the 
newly found proof is really valid. Instead of arguing from the first principles 
(definitions) like in the human proof, the combined inductive-deductive system 
was smart enough to find a combination of previously proved lemmas (properties) 
that justify the conjecture more quickly. 

A similar newer evaluation is done on a whole MML in [AKU12], comparing 
the original MML theory graph with the theory graph for the 9141 automati- 
cally found proofs. A illustrative example from there is theorem C0MSEQ_3 : 40 9 , 

9 http: //mizar . cs . ualberta. ca/~mptp/cgi-bin/browseref s . cgi?ref s=t40_comseq_3 



proving the relation between the limit of a complex sequence and its real and 
imaginary parts: 

Theorem 1. Let (c„) = (a n + ib n ) be a convergent complex sequence. Then (a n ) 
and (b n ) converge and lima„ = Re(\imc n ) and lim6„ = iro(lim c„). 

The convergence of (a„) and (b n ) was done the same way by the human formal- 
izer and the ATP. The human proof of the limit equations proceeds by looking 
at the definition of a complex limit, expanding the definitions, and proving that 
a and b satisfy the definition of the real limit (finding a suitable n for a given 
e). The AI/ATP just notices that this kind of groundwork was already done in 
a "similar" case C0MSEQ_3 : 39 10 , which says that: 

Theorem 2. If (a n ) and (b n ) are convergent, then limc„ = lima rl + ilimb n . 

And it also notices the "similarity" (algebraic simplification) provided by COMPLEX 1 : 28 11 : 

Theorem 3. Re(a + ib) = a A Im(a + ib) = b 

Such (automatically found) manipulations can be used (if noticed!) to avoid the 
"hard thinking" about the epsilons in the definitions. 

5.3 ATP-based explanation, presentation, and cross-verification of 
Mizar proofs 

While the whole proofs of Mizar theorems can be quite hard for ATP systems, 
re-proving the Mizar atomic justification steps (called Simple Justifications in 
Mizar) turns out to be quite easy for ATPs. The combinations of E and SPASS 
usually solve more than 95% of such problems, and with smarter automated 
methods for axiom selection 99.8% success rate (f4 unsolved problems from 
6765) was achieved in [US07]. This makes it practical to use ATPs for expla- 
nation and presentation of the (not always easily understandable) Mizar sim- 
ple justifications, and to construct larger systems for independent ATP-based 
cross-verification of (possibly very long) Mizar proofs. In [US07] such a cross- 
verification system is presented, using the GDV [Sut06] system which was ex- 
tended to process Jaskowski-style natural deduction proofs that make frequent 
use of assumptions (suppositions). MPTP was used to translate Mizar proofs 
to this format, and GDV together with the E, SPASS, and MaLARca sys- 
tems were used to automatically verify the structural correctness of proofs, 
and 99.8% of the proof steps needed for the 252 Mizar problems selected for 
the MPTP Challenge (see below). This provides the first practical method 
for independent verification of Mizar, and opens the possibility of importing 
Mizar proofs into other proof assistants. A web presentation allowing interac- 
tion with ATP systems and GDV verification of Mizar proofs has been set up 
at http://www.tptp.org/MizarTPTP (described in [UTSP07]), and an online 
service [URSII] integrating the ATP functionalities has been built. 12 

10 http: //mizar. cs .ualberta. ca/~mptp/cgi-bin/browseref s . cgi?ref s=t39_comseq_3 
http: //mizar . cs .ualberta. ca/~mptp/cgi-bin/browseref s . cgi?ref s=t28_complexl 
12 http: //mws . cs .ru.nl/~mptp/MizAR.html, http : //mizar . cs .ualberta. ca/~mptp/MizAR.html 



5.4 Use of MPTP for ATP challenges and competitions 

The first MPTP problems were included into the TPTP library in 2006, and 
were already used for the 2006 CASC competition. In 2006, the MPTP $100 
Challenges 13 were created and announced. This is a set of 252 related large- 
theory problems needed for one half (on of two implications) of the Mizar proof 
of the general topological Bolzano- Weierstrass theorem. Unlike the CASC com- 
petition, the challenge had an overall timelimit (252 * 5 minutes = 21 hours) for 
solving the problems, allowing complementary techniques like machine learning 
from previous solutions to be experimented with transparently in runtime. The 
challenge was won a year later by the leanCoP [OB03] system, having already 
revealed several interesting approaches to ATP in large theories: goal-directed 
calculi like connection tableaux (used in leanCoP), model-based axiom selection 
(used e.g. in SRASS [SP07]), and machine learning of axiom relevance (used 
in MaLARea). The MPTP Challenge problems were again included into TPTP 
and used for the standard CASC competition in 2007. In 2008, the CASC-LTB 
(Large Theory Batch) category appeared for the first time with a similar setting 
like the MPTP Challenges, and additional large-theory problems from the Cyc 
and SUMO ontologies. A set of 245 relatively hard Mizar problems was included 
for this purpose to TPTP, coming from the most advanced parts of the Mizar 
library. The problems come in four versions, containing different amount of the 
previously available MML theorems and definitions as axioms. The largest ver- 
sions thus contain over 50000 axioms. An updated larger version (MPTP2078) 
of the MPTP Challenge benchmark was developed in 2011 [AKT+11], consist- 
ing of 2078 interrelated problems in general topology, and making use of precise 
dependency analysis of the MML for constructing the easy versions of the prob- 
lems. 

5.5 Development of larger AI metasystems like MaLARea and 
MaLeCoP on MPTP data 

In Section 5.2, it is explained how the deeply defined notion of mathematical 
truth (implemented through ATPs) can improve the evaluation of learning sys- 
tems working on large semantic knowledge bases like translated MML. This 
is however only one part of the AI fun made possible by such large libraries 
being available to ATPs. Another part is that the newly found proofs can be 
recycled, and again used for learning in such domains. This closed loop (see 
Figure 1) between using deductive methods to find proofs, and using induc- 
tive methods to learn from the existing proofs and suggest new proof direc- 
tions, is the main idea behind the Machine Learner for Automated Reasoning 
(MaLARea [Urb07a,USPV08]) metasystem, which turns out to have by a large 
margin the best performance on large theory benchmarks like the MPTP Chal- 
lenge and MPTP2078. There are many kinds of information that such an au- 
tonomous metasystem can try to use and learn. The second version of MaLARea 
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Fig. 1. The basic MaLARea loop. 
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already uses also structural and semantic features of formulas for their charac- 
terization and for improving the axiom selection. 

MaLARea can work with arbitrary ATP backends (E and SPASS by default), 
however, the communication between learning and the ATP systems is high- 
level: The learned relevance is used to try to solve problems with varied limited 
numbers of the most relevant axioms. Successful runs provide additional data for 
learning (useful for solving related problems) , while unsuccessful runs can yield 
countermodels, which can be re-used for semantic pre-selection and as additional 
input features for learning. An advantage of such high-level approach is that it 
gives a generic inductive (learning)/dcductive (ATP) mctasystem to which any 
ATP can be easily plugged as a blackbox. Rs disadvantage is that it does not 
attempt to use the learned knowledge for guiding the ATP search process once 
the axioms are selected. 

Hence the logical next step done in the Machine Learning Connection Prover 
(MaLeCoP) prototype [UVS11]: the learned knowledge is used for guiding proof 
search inside a theorem prover (leanCoP in this case). MaLeCoP follows a general 
advising design that is as follows (see also Figure 2): The theorem prover (P) has 
a sufficiently fast communication channel to a general advisor (A) that accepts 
queries (proof state descriptions) and training data (characterization of the proof 



state 14 together with solutions 15 and failures) from the prover, processes them, 
and replies to the prover (advising, e.g., which clauses to choose). The advisor 
A also talks to external (in our case learning) system(s) (E). A translates the 
queries and information produced by P to the formalism used by a particular E, 
and translates E's guidance back to the formalism used by P. At suitable time, 
A also hands over the (suitably transformed) training data to E, so that E can 
update its knowledge of the world on which its advice is based. 



Fig. 2. The General Architecture used for MaLeCoP 
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MaLeCoP is a very recent work, which has so far revealed interesting issues 
in using detailed smart guidance in large theories. Even though naive Bayes is a 
comparatively fast learning and advising algorithm, in a large theory it turned 
out to be about 1000 times slower than a primitive tableaux extension step. 

14 instantiated, e.g., as the set of literals/symbols on the current branch 

15 instantiated, e.g., as the description of clauses used at particular proof states 



So a number of strategies had to be denned that use the smart guidance only 
at the crucial points of the proof search. Even with such limits, the preliminary 
evaluation done on the MPTP Challenge already showed an average proof search 
shortening by a factor of 20 in terms of the number of tableaux inferences. 

There arc a number of development directions for knowledge-based AI/ATP 
architectures like MaLARea and MaLeCoP. Extracting lemmas from proofs and 
adding them to the set of available premises, creating new interesting conjectures 
and defining new useful notions, finding optimal strategies for problem classes, 
faster guiding of the internal ATP search, inventing policies for efficient governing 
of the overall inductive-deductive loop: all these are interesting AI tasks that 
become relevant in this large-theory setting, and that seem to be highly relevant 
for the ultimate task of doing mathematics and perhaps even generally thinking 
automatically. A particularly interesting research issue is the following. 

Consistency of Knowledge and Its Transfer: Probably the most impor- 
tant research topic in the emerging AI approaches to large-theory automated 
reasoning is the issue of consistency of knowledge and its transfer. In an unpub- 
lished experiment with MaLARea in 2007 over a set of problems exported by an 
early Isabelle/Sledgehammer version, MaLARea quickly solved all of the prob- 
lems, even though some of them were supposed to be hard. Larry Paulson has 
tracked the problem to an (intentional) simplification in the first-order encoding 
of Isabcllc/HOL types, which typically raises the overall ATP success rate (after 
checking in Isabelle the imported proofs). Once the inconsistency originating 
from the simplification was however found by the guiding AI system, MaLARea 
focused on fully exploiting it even in problems where such inconsistency would 
be ignored by standard ATP search. 

An opposite phenomenon happened recently in experiments with a clausal 
version of MaLARea. The CNF form introduces a large number of new skolem 
symbols that make similar problems and formulas look different after clausifi- 
cation (despite the fact that the skolemization attempts hard to use the same 
symbol whenever it can), and the AI guidance based on symbols and terms de- 
teriorates. The same happens with the AI guidance based on models of formu- 
las (generated by Mace and Paradox), because disjoint skolem symbols prevent 
a straightforward evaluation (using the LADR clausef ilter utility) of many 
clauses in models that are found for differently named skolem functions. The in- 
ability of the AI guidance to obtain and use the information about the similarity 
of the clauses results in about 100 less problems solved (700 vs. 800) in the first 
ten MaLARea iterations over the MPTP2078 benchmark. 

Hence a trade-off: smaller pieces of knowledge (like clauses) allow better 
focus, but techniques like skolemization can destroy some explicit similarities 
useful for learning. Designing suitable representations and learning methods on 
top of the knowledge is therefore very significant for the final performance, while 
inconsistent representations can be fatal. Using CNF and its various alternatives 
and improvements has been a topic discussed many times in the ATP community 
(also for example by Quaife in his book). Here we note that it is not just the 



low-level ATP algorithms that are influenced by such choices of representation, 
but the problem extends to and significantly influences also the performance of 
high-level heuristic guidance methods in large theories. 

6 Future QED-like Directions 

There is large amount of work to be done on practically all the projects men- 
tioned above. The MPTP translation is by no means optimal (and especially 
proper encoding of arithmetics needs more experiments and work). Import of 
ATP proofs to Mizar practically does not exist (there is a basic translator tak- 
ing Otter proof objects to Mizar, however this is very distant from the readable 
proofs in MML). With sufficiently strong ATP systems, the cross- verification 
of the whole MML could be attempted, and work on import of such detailed 
ATP proofs into other proof assistants could be started. The MPTP handling of 
second-order constructs is in some sense incomplete, and either a translation to 
(finitely axiomatized) NBG set theory (used by Quaife), or usage of higher-order 
ATPs would be interesting from this point of view. 

More challenges and interesting presentation tools can be developed, for ex- 
ample an ATP-enhanced wiki for Mizar is an interesting QED-like project that 
is now being worked on [UARG10]. The heuristic and machine learning methods, 
and combined AI metasystems, have a very long way to go, some future direc- 
tions are mentioned above. This is no longer only about mathematics: all kinds 
of more or less formal large knowledge bases are becoming available in other 
sciences, and automated reasoning could become one of the strongest methods 
for general reasoning in sciences when sufficient amount of formal knowledge 
exists. Strong ATP methods for large formal mathematics could also provide 
useful semantic filtering for larger systems for automatic formalization of math- 
ematical papers. This is a field that has been so far deemed to be rather science 
fiction than a real possibility, 16 however heuristic AI methods used for knowledge 
search and machine translation are becoming more and more mature, and in con- 
junction with strong ATP methods they could provide a basis for a large-scale 
(semi-)automated QED project. 
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